5  Lab Part

School of Accounting, Finance and Economics

BECS2002 / Econometrics and Data Analytics

Lab Handbook

Week 15

Module coordinator: Dr Camilo Calderon

Email: cam.calderon@dmu.ac.uk

This version: 21/02/2025

This handbook has been produced to provide students with specific information and guidance about their labs. The contents of this handbook are indicative, this means they can be updated or modified upon the review of the module leader (Cam).

An electronic version of this handbook (which is continuously updated) is available on the VLE system, Blackboard, which you should consult regularly as the main reference point throughout your studies.

Indicative Contents

Prerequisite 6

Lecture 1 - The Simple Linear Regression Model 7

Lecture 2 - Prediction with the Linear Regression Model 15

Video 1 - Using Indicator Variables in a Regression & Monte Carlo 23

Video 2 -Interval Estimation 28

Lab 1 – Hypothesis Test, P-Value and Testing Linear Combination of Parameters 35

Lab 2 -The p-Value 40

Lab 3 - Prediction, R-squared, and Modelling 45

Lab 4 Linear-Log Models 52

Lab 5- Polynomial Models 59

Lab 6 – Hands-on on your assignment 68

Prerequisite

Review the assignment brief for the report assessment and download an appropriate dataset.

Students will adapt the R scripts of the next three weeks of labs to the dataset and topic of the report.

Lecture 1 - The Simple Linear Regression Model

Note: students must adapt the following R scrips to their assignment 2 (report) using an appropriate dataset.

Summary: in simple linear regression, we use one independent variable to predict a dependent variable. The relationship is represented by a straight line with an intercept and a slope.

Introduction: In this lecture, we will explore the fundamentals of the Simple Linear Regression Model, a statistical method used to understand the relationship between two variables. This model is foundational for understanding more complex statistical analyses in your future studies.

Key terms: A dependent variable, represented as ‘y’, is what we aim to predict, while an independent variable, denoted as ‘x’, is the variable(s) we use to explain y. Imagine predicting a student’s final grade (y) based on their hours of study (x). Here, the final grade depends on, or is predicted by, the hours of study.

The General Model

The model assumes a linear relationship between the conditional expectation of a dependent variable, y, and an independent variable, x. Independent variable sometimes are called ‘response’ or ‘response variable’, and the independent variables ‘regressors.’ The assumed relationship has the form:

yi = β1 + β2xi + ei, (1)

where

y is the dependent variable

x is the independent variable

e is an error term

σ2 is the variance of the error term

β1 is the intercept parameter or coefficient

β2 is the slope parameter or coefficient

i stands fot the i -th observation in the dataset, i = 1,2,…,N

N is the number of observations in the dataset

Figure: Example of several observations for any given x

The predicted, or estimated value of y given x is given by the following equation; in general, the hat symbol indicates an estimated or a predicted value.

yˆ = b1 + b2x (2)

The model assumes that the values of x are previously chosen (therefore, they are non-random), that the variance of the error term, σ2, is the same for all values of x, and that there is no connection between one observation and another (no correlation between the error terms of two observations). In addition, it is assumed that the expected value of the error term for any value of x is zero.

The subscript i in Equation 1 indicates that the relationship applies to each of the N observations. Thus, there must be specific values of y, x, and e for each observation. However, since x is not random, there are, typically, several observations sharing the same x, as the scatter diagram in Figure above shows.

Example: Food Expenditure versus Income

The data for this example is stored in the R package PoEdata (To check if the package PoEdata is installed, look in the Packages list.)

library(PoEdata)

data(food)

head(food)

It is always a good idea to visually inspect the data in a scatter diagram, which can be created using the function plot(). Figure 2.2 is a scatter diagram of food expenditure on income, suggesting that there is a positive relationship between income and food expenditure.

data(“food”, package=“PoEdata”)

plot(food\(income, food\)food_exp, ylim=c(0, max(food\(food_exp)), xlim=c(0, max(food\)income)), xlab=“weekly income in $100”, ylab=“weekly food expenditure in $”, type = “p”)

Estimating a Linear Regression

The R function for estimating a linear regression model is lm(y~x, data) which, used just by itself does not show any output; It is useful to give the model a name, such as mod1, then show the results using summary(mod1). If you are interested in only some of the results of the regression, such as the estimated coefficients, you can retrieve them using specific functions, such as the function coef(). For the food expenditure data, the regression model will be

food_exp = β1 + β2income + e (3)

where the subscript i has been omitted for simplicity.

Figure: A scatter diagram for the food expenditure model

library(PoEdata)

mod1 <- lm(food_exp ~ income, data = food)

b1 <- coef(mod1)[[1]]

b2 <- coef(mod1)[[2]]

smod1 <- summary(mod1)

smod1

5.1

5.2 Call:

5.3 lm(formula = food_exp ~ income, data = food)

5.4

5.5 Residuals:

5.6 Min 1Q Median 3Q Max

5.7 -223.03 -50.82 -6.32 67.88 212.04

5.8

5.9 Coefficients:

5.10 Estimate Std. Error t value Pr(>|t|)

5.11 (Intercept) 83.42 43.41 1.92 0.062 .

5.12 income 10.21 2.09 4.88 0.000019 ***

5.13

5.14 Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

5.15

Figure: Scatter diagram and regression line for the food expenditure model

5.16 Residual standard error: 89.5 on 38 degrees of freedom

5.17 Multiple R-squared: 0.385, Adjusted R-squared: 0.369

5.18 F-statistic: 23.8 on 1 and 38 DF, p-value: 0.0000195

The function coef() returns a list containing the estimated coefficients. You can access individual coefficients. The estimated value of β1 is b1 <- coef(mod1)[[1]], which is equal to 83.416002, and the estimated value of β2 is b2 <- coef(mod1)[[2]], which is equal to 10.209643.

The intercept parameter, β1, usually has little importance in econometric models. The estimated value of β2 suggests that the food expenditure for an average family increases by 10.209643 when the family income increases by 1 unit, which in this case is $100. See the coefficients in the estimation results table.

The R function abline() adds the regression line to the previously plotted scatter diagram, as the Figure above shows.

How can one retrieve various regression results? These results exist in two R objects produced by the lm() function: the regression object, such as mod1 in the above code sequence, and the regression summary, which I denoted by smod1. The next code shows how to list the names of all results in each object.

5.19 [1] “call” “terms” “residuals” “coefficients”

5.20 [5] “aliased” “sigma” “df” “r.squared”

5.21 [9] “adj.r.squared” “fstatistic” “cov.unscaled”

To retrieve a particular result you just refer to it with the name of the object, followed by the $ sign and the name of the result you wish to retrieve. For instance, if we want the vector of coefficients from mod1, we refer to it as mod1\(coefficients and smod1\)coefficients:

mod1$coefficients

5.22 (Intercept) income ## 83.4160 10.2096

smod1$coefficients

As we have seen before, however, some of these results can be retrieved using specific functions, such as coef(mod1), resid(mod1), fitted(mod1), and vcov(mod1).

MCQs

What does a simple linear regression model assume about the relationship between the dependent variable y and the independent variable x?

A. A quadratic relationship.

B. A logarithmic relationship.

C. A linear relationship.

D. An exponential relationship.

In the equation , what does represent?

A. The slope of the regression line.

B. The error term.

C. The intercept of the regression line.

D. The dependent variable.

What is indicated by the slope parameter in a simple linear regression model?

A. The variance of the error term.

B. The change in y for a one-unit change in x.

C. The intercept of the regression line.

D. The number of observations in the dataset.

In the context of simple linear regression, what does the term “residuals” refer to?

A. The predicted values of y.

B. The independent variable values.

C. The differences between the observed and predicted values of y.

D. The coefficients of the regression model.

When using the R function lm() for linear regression, what is the purpose of the summary() function?

A. To list the names of all results in the regression object.

B. To change the coefficients of the regression model.

C. To plot the regression line on a scatter plot.

D. To display a detailed summary of the regression results, including coefficients, residuals, and R-squared values.

What is the primary purpose of estimating a linear regression model using random subsamples in R?

A. To change the coefficients of the original model.

B. To assess the variability of the regression coefficients.

C. To increase the number of observations in the dataset.

D. To create a non-linear relationship between variables.

In a simple linear regression model, what does an “Adjusted R-squared” value indicate?

A. The total number of observations in the dataset.

B. The precision of the intercept parameter.

C. The proportion of variance in the dependent variable explained by the model, adjusted for the number of predictors.

D. The standard error of the regression coefficients.

Lecture 2 - Prediction with the Linear Regression Model

The estimated regression parameters, b1 and b2 allow us to predict the expected food expenditure for any given income. All we need to do is to plug the estimated parameter values and the given income into an equation like Equation (2). For example, the expected value of food_exp for an income of $2000 is calculated in Equation (4) below. (Remember to divide the income by 100, since the data for the variable income is in hundreds of dollars.)

R does this calculations for us with its function called predict(). Extending the example to more than one income, say income = $2000, $2500, and $2700. The function predict() in

R requires that the new values of the independent variables be organized under a particular form, called a data frame. In R, a set of numbers is held together using the structure c(). The following sequence shows this example.

5.23 income=$2000 $2500 $2700

5.24 287.609 338.657 359.076

Repeated Samples to Assess Regression Coefficients

The regression coefficients b1 and b2 are random variables because they depend on the sample. Let us construct a number of random subsamples from the food data and re-calculate b1 and b2. A random subsample can be constructed using the function sample(), as the following example illustrates only for b2.

5.25 [1] 9.88

The result, b2 = 9.88, is the average of 50 estimates of b2.

Estimated Variances and Covariance of Regression Coefficients

Many applications require estimates of the variances and covariances of the regression coefficients. R stores them in the a matrix vcov():

(varb1 <- vcov(mod1)[1, 1])

5.26 [1] 1884.44

(varb2 <- vcov(mod1)[2, 2])

5.27 [1] 4.38175

(covb1b2 <- vcov(mod1)[1,2])

5.28 [1] -85.9032

Non-Linear Relationships

Sometimes the scatter plot diagram or some theoretical considerations suggest a non-linear relationship. The most popular non-linear relationships involve logarithms of the dependent or independent variables and polynomial functions.

The quadratic model requires the square of the independent variable.

yi = β1 + β2x2i + ei (5)

In R, independent variables involving mathematical operators can be included in a regression equation with the function I(). The following example uses the dataset br from the package PoEdata, which includes the sale prices and the attributes of 1080 houses in Baton Rouge, LA. price is the sale price in dollars, and sqft is the surface area in square feet.

Figure 2.4: Fitting a quadratic model to the ’br‘ dataset

elasticity=DpriceDsqft*sqftx/pricex

b1; b2; DpriceDsqft; elasticity #prints results

5.29 [1] 55776.6

5.30 [1] 0.0154213

5.31 [1] 61.6852 123.3704 185.0556

5.32 [1] 1.05030 1.63125 1.81741

Marginal Effect: it measures the change in the dependent variable (e.g., price) for a unit change in an independent variable (e.g., sqft)

Elasticity: It measures the percentage change in the dependent variable (e.g., price) for a one percent change in an independent variable (e.g., sqft). For instance, if the elasticity of price with respect to sqft is 0.5, a 1% increase in sqft leads to a 0.5% increase in price. if the elasticity is less than 1, it suggests that changes in sqft have a proportionally smaller effect on price, which might imply that other factors are also important in determining the price. unlike linear models, in non-linear models, elasticity can change at different levels of the independent variable.

Draw a scatter diagram and see how the quadratic function fits the data. The next chunk of code provides two alternatives for constructing such a graph. The first simply draws the quadratic function on the scatter diagram, using the R function curve(); the second uses the function lines, which requires ordering the dataset in increasing values of sqft before the regression model is evaluated, such that the resulting fitted values will also come out in the same order.

mod31 <- lm(price~I(sqft^2), data=br)

plot(br\(sqft, br\)price, xlab=“Total square feet”, ylab=“Sale price, $”, col=“grey”)

#add the quadratic curve to the scatter plot:

curve(b1+b2*x^2, col=“red”, add=TRUE)

An alternative way to draw the fitted curve:

Figure: A comparison between the histograms of ’price‘ and ’log(price)‘

The log-linear model regresses the log of the dependent variable on a linear expression of the independent variable (unless otherwise specified, the log notation stands for natural logarithm, following a usual convention in economics):

log(yi) = β1 + β2xi + ei (6)

One of the reasons to use the log of an independent variable is to make its distribution closer to the normal distribution. Let us draw the histograms of price and log(price) to compare them (see last Figure above). It can be noticed that that the log is closer to the normal distribution.

hist(br$price, col=‘grey’)

hist(log(br$price), col=‘grey’)

We are interested, as before, in the estimates of the coefficients and their interpretation, in the fitted values of price, and in the marginal effect of an increase in sqft on price.

The coefficients are b1 = 10.84 and b2 = 0.00041, showing that an increase in the surface area (sqft) of an apartment by one unit (1 sqft) increases the price of the apartment by 0.041 percent. Thus, for a house price of $100,000, an increase of 100 sqft will increase the price by approximately 100 ∗ 0.041 percent, which is equal to $4112.7. In general, the marginal effect of an increase in x on y in Equation 6 is

and the elasticity is

The next lines of code show how to draw the fitted values curve of the loglinear model and how to calculate the marginal effect and the elasticity for the median price in the dataset. The fitted values are here calculated using the formula

ordat <- br[order(br$sqft), ] #order the dataset

mod4 <- lm(log(price)~sqft, data=ordat)

plot(br\(sqft, br\)price, col=“grey”)

lines(exp(fitted(mod4))~ordat$sqft, col=“blue”, main=“Log-linear Model”)

pricex<- median(br$price)

sqftx <- (log(pricex)-coef(mod4)[[1]])/coef(mod4)[[2]]

(DyDx <- pricex*coef(mod4)[[2]])

5.33 [1] 53.465

(elasticity <- sqftx*coef(mod4)[[2]])

5.34 [1] 0.936693

R allows us to calculate the same quantities for several (sqft, price) pairs at a time, as shown in the following sequence:

Figure: The fitted value curve in the log-linear model

MCQs

1 What is the primary purpose of using the estimated regression parameters and in a linear regression model?

A. To calculate the mean of the dependent variable.

B. To predict the expected value of a dependent variable for a given independent variable.

C. To determine the correlation between independent and dependent variables.

D. To assess the distribution of the independent variable.

2 In the context of R programming, what is the function of the lm() function when used in linear regression analysis?

A. To plot the regression line on a scatter plot.

B. To create a new data frame.

C. To estimate a linear model.

D. To calculate the mean of a dataset.

  1. What does the predict() function in R primarily do in the context of linear regression?

A. It calculates the variance of the regression coefficients.

B. It estimates the dependent variable values based on the regression model and new data.

C. It generates random samples for assessing regression coefficients.

D. It plots a scatter diagram for visual inspection.

  1. Why are regression coefficients and considered random variables?

A. Because they vary depending on the chosen statistical test.

B. Because they are calculated using random number generators.

C. Because they depend on the specific sample used in the regression.

D. Because they change with each iteration of the regression model.

  1. What is the purpose of using random subsamples in the assessment of regression coefficients?

A. To visualize the data distribution.

B. To predict values for a new dataset.

C. To evaluate the stability and variability of the coefficients.

D. To calculate the mean and median of the dataset.

  1. In the context of regression analysis, what is the vcov() matrix in R used for?

A. For storing predicted values.

B. For estimating variances and covariances of regression coefficients.

C. For plotting regression lines.

D. For generating random subsamples.

  1. What is the significance of using the data.frame() function in conjunction with the predict() function in R?

A. To format the independent variable values appropriately for prediction.

B. To calculate the correlation coefficient between variables.

C. To store the regression coefficients.

D. To create a scatter plot of the data​

bcbccba

Video 1 - Using Indicator Variables in a Regression & Monte Carlo

An indicator or binary variable marks the presence or the absence of some attribute, such as gender or race if the observational unit is an individual or location if the observational unit is a house. In the dataset utown, the variable utown is 1 if a house is close to the university and 0 otherwise. Here is a simple linear regression model that involves the variable utown:

pricei = β1 + β2utowni (10)

The coefficient of such a variable in a simple linear model is equal to the difference between the average prices of the two categories; the intercept coefficient of the model in Equation 10 is equal to the average price of the houses that are not close to university. calculate the average prices for each category, which are denoted in the following sequence of code price0bar and price1bar:

6 Load the ‘utown’ dataset which contains housing data, including proximity to a university

data(utown)

7 Calculate the average price of houses not close to the university

price0bar <- mean(utown\(price[which(utown\)utown==0)])

8 Calculate the average price of houses close to the university

price1bar <- mean(utown\(price[which(utown\)utown==1)])

The = 277.24 for close to university, and = 215.73 for those not close. I now show that the same results yield the coefficients of the regression model in Equation 10:

9 Perform linear regression with ‘utown’ as an indicator variable

mod5 <- lm(price~utown, data=utown)

10 Extract coefficients from the regression model

b1 <- coef(mod5)[[1]] #intercept

b2 <- coef(mod5)[[2]] # Slope for the indicator variable

The results are: = b1 = 215.73 for non-university houses, and = b1 +b2 = 277.24 for university houses.

Monte Carlo Simulation

A Monte Carlo simulation generates random values for the dependent variable when the regression coefficients and the distribution of the random term are given. The following example seeks to determine the distribution of the independent variable in the food expenditure model in Equation 3.

Figure: The theoretical (true) probability distributions of food expenditure, given two levels of income

Next, we calculate the variance of b2 and plot the corresponding density function.

Now, with the same values of b1, b2, and error standard deviation, we can generate a set of values for y, regress y on x, and calculate an estimated value for the coefficient b2 and its standard error.

set.seed(12345)

y <- b1+b2*x+rnorm(N, mean=0, sd=sde)

mod6 <- lm(y~x)

b1hat <- coef(mod6)[[1]]

b2hat <- coef(mod6)[[2]]

mod6summary <- summary(mod6) #the summary contains the standard errors seb2hat <- coef(mod6summary)[2,2]

The results are b2 = 11.64 and se(b2) = 1.64. The strength of a Monte Carlo simulation is the possibility of repeating the estimation of the regression parameters for a large number of automatically generated samples. Thus, we can obtain a large number of values for a parameter, say b2, and then determine its sampling characteristics. For instance, if the mean of these values is close to the initially assumed value b2 = 10, we conclude that our estimator (the method of estimating the parameter) is unbiased.

We are going to use this time the values of x in the food dataset, and generate y using the linear model with b1 = 100 and b2 = 10.

The mean and standard deviation of the estimated 40 values of b2 are, respectively, 9.974985 and 1.152632.

The following figure shows the simulated distribution of b2 and the theoretical one.

Figure: The simulated and theoretical distributions of b2

Video 2 -Interval Estimation

Understanding Interval Estimation:

Interval estimation involves finding a range (interval) within which we expect the true value of a parameter, like a regression coefficient, lies.

Unlike a single estimate, an interval gives a range of plausible values, offering a better understanding of the estimate’s precision.

library(xtable)

library(PoEdata)

library(knitr)

So far we estimated only a number for a regression parameter such as β2. This estimate gives no indication of its reliability, since it is just a realization of the random variable b2. An interval estimate, which is also known as a confidence interval is an interval centered on an estimated value, which includes the true parameter with a given probability, say 95%. A coefficient of the linear regression model such as b2 is normally distributed with its mean equal to the population parameter β2 and a variance that depends on the population variance σ2 and the sample size:

The Estimated Distribution of Regression Coefficients

In regression, coefficients (like b2) are not fixed numbers but random variables because they vary depending on the sample data.

We often use the estimated coefficient (like b2 from our sample data) to guess the true population parameter (like β2).

Equation 1 gives the theoretical distribution of a linear regression coefficient, a distribution that is not very useful since it requires the unknown population variance σ2. If we replace σ2 with an estimated variance σˆ2 given in Equation 2, the standardized distribution of b2 becomes a t distribution with N − 2 degrees of freedom.

Equation 3 shows the the t-ratio:

Confidence Interval in General

A confidence interval includes the true parameter with a certain probability, such as 95%.

It’s a way of saying, “We are 95% confident that the true value lies within this range.”

The confidence interval for a coefficient is calculated using its estimated value, the standard error, and the t-distribution (since we typically don’t know the population variance).

The standard error measures the variability of the estimate.

An interval estimate of b2 based on the t-ratio is calculated in Equation 4, which we can consider as “an interval that includes the true parameter β2 with a probability of 100(1 − α)%.” In this context, α is called significance level, and the interval is called, for example, a 95% confidence interval estimate for β2. The critical value of the t-ratio, tc, depends on the chosen significance level and on the number of degrees of freedom. In R, the function that returns critical values for the t distribution is qt(1 − ,df), where df is the number of degrees of freedom.

A side note about using distributions in R. There are four types of functions related to distributions, each type’s name beginning with one of the following four letters: p for the cumulative distribution function, d for density, r for a draw of a random number from the respective distribution, and q for quantile. This first letter is followed by a few letters suggesting what distribution we refer to, such as norm, t, f, and chisq. Now, if we put together the first letter and the distribution name, we get functions such as the following, where x and q stand for quantiles, p stands for probability, df is degree of freedom (of which F has two), n is the desired number of draws, and lower.tail can be TRUE (default) if probabilities are P[X ≤ x] or FALSE if probabilities are P[X > x]:

• For the uniform distribution:

dunif(x, min = 0, max = 1)

punif(q, min = 0, max = 1, lower.tail = TRUE)

qunif(p, min = 0, max = 1, lower.tail = TRUE)

runif(n, min = 0, max = 1)

• For the normal distribution:

dnorm(x, mean = 0, sd = 1)

pnorm(q, mean = 0, sd = 1, lower.tail = TRUE)

qnorm(p, mean = 0, sd = 1, lower.tail = TRUE)

rnorm(n, mean = 0, sd = 1)

• For the t distribution:

dt(x, df)

pt(q, df, lower.tail = TRUE)

qt(p, df, lower.tail = TRUE)

rt(n, df)

• For the F distribution:

df(x, df1, df2)

pf(q, df1, df2, lower.tail = TRUE)

qf(p, df1, df2, lower.tail = TRUE)

rf(n, df1, df2)

• For the χ2 distribution:

dchisq(x, df)

pchisq(q, df, lower.tail = TRUE) – qchisq(p, df, lower.tail = TRUE)

rchisq(n, df)

Example: Confidence Intervals in the food Model

Using R programming, you can calculate a confidence interval for a regression coefficient.

For example, to find a 95% confidence interval for the coefficient on income in a food expenditure model, you would:

Estimate the regression model (lm(food_exp~income data=food)).

Calculate the standard error of the coefficient (seb2 <- coef(smod1)[22]).

Find the critical value from the t-distribution (tc <- qt(1-alpha/2, df)).

Calculate the lower and upper bounds of the interval (lowb <- b2 - tcseb2 and upb <- b2 + tcseb2).

Calculate a 95% confidence interval for the coefficient on income in the food expenditure model. Besides calculating confidence intervals, the following lines of code show how to retrieve information such as standard errors of coefficients from the summary() output. The function summary summarizes the results of a linear regression, some of which are not available directly from running the model itself.

The resulting confidence interval for the coefficient b2 in the food simple regression model is (5.97,14.45).

R has a special function, confint(model), that can calculate confidence intervals taking as its argument the name of a regression model. The result of applying this function is a K×2 matrix with a confidence interval (two values: lower and upper bound) on each row and a number of lines equal to the number of parameters in the model (equal to 2 in the simple linear regression model). Compare the values from the next code to the ones from the previous to check that they are equal.

ci <- confint(mod1)

print(ci)

10.1 2.5 % 97.5 %

10.2 (Intercept) -4.46328 171.2953 ## income 5.97205 14.4472

lowb_b2 <- ci[2, 1] # lower bound

upb_b2 <- ci[2, 2] # upper bound.

Confidence Intervals in Repeated Samples

The Table shows the lower and upper bounds of the confidence intervals of β1 and β2.

Table: Confidence intervals for b1 and b2

Key Points:

Intervals vs. Single Estimates: Confidence intervals provide more information than just a point estimate by showing a range where the true value likely falls.

Significance Level (α): The choice of α (like 0.05 for a 95% confidence interval) affects the interval’s width – lower α means a wider interval.

Interpreting Confidence Intervals: If an interval for a coefficient does not include 0, it suggests the coefficient is significantly different from 0, indicating a potential relationship between the variables.

Multiple Choice Questions:

  1. What does an interval estimate in regression analysis typically provide?
  1. A single fixed value of a parameter

  2. A range within which the true parameter value is likely to fall

  3. The exact value of the population variance

  4. The probability of the sample mean being accurate

  1. Why are regression coefficients considered random variables?
  1. They change with each new sample

  2. They are always equal to zero

  3. They remain constant across different samples

  4. They can only take integer values

  1. A 95% confidence interval implies that:
  1. The true parameter will always fall within this range

  2. There’s a 95% chance that the interval contains the true parameter

  3. 95% of the sample data falls within this interval

  4. The interval is 95 units wide

  1. What is required to calculate a confidence interval for a regression coefficient?
  1. Only the mean of the coefficient

  2. The estimated coefficient and its standard error

  3. The population variance

  4. A z-distribution table

  1. In the context of confidence intervals, what does the term ‘α’ represent?
  1. Coefficient of determination

  2. Significance level

  3. Mean of the distribution

  4. Degree of freedom

  1. What does the function qt(1-alpha/2, df) in R return?
  1. The mean of the t-distribution

  2. The critical value from the t-distribution

  3. The standard error of the estimate

  4. The p-value for the test

  1. What would be an indication that a regression coefficient is statistically significant?
  1. Its confidence interval includes 0

  2. Its confidence interval does not include 0

  3. The interval is very narrow

  4. The standard error is zero

Answer Key:

  1. A range within which the true parameter value is likely to fall

  2. They change with each new sample

  3. There’s a 95% chance that the interval contains the true parameter

  4. The estimated coefficient and its standard error

  5. Significance level

  6. The critical value from the t-distribution

  7. Its confidence interval does not include 0

Lab 1 – Hypothesis Test, P-Value and Testing Linear Combination of Parameters

Hypothesis Tests

Hypothesis testing seeks to establish whether the data sample at hand provides sufficient evidence to support a certain conjecture (hypothesis) about a population parameter such as the intercept in a regression model, the slope, or some combination of them. The procedure requires three elements: the hypotheses (the null and the alternative), a test statistic, which in the case of the simple linear regression parameters is the t-ratio, and a significance level, α.

Suppose we believe that there is a significant relationship between a household’s income and its expenditure on food, a conjecture which has led us to formulate the food expenditure model in the first place. Thus, we believe that β2, the (population) parameter, is different from zero. Equation 5 shows the null and alternative hypotheses for such a test.

In general, if a null hypothesis H0 : βk = c is true, the t statistic (the t-ratio) is given by Equation 6 and has a t distribution with N − 2 degrees of freedom.

Test the hypothesis in Equation 5, which makes c = 0 in Equation 6. Let α = 0.05. The following Table shows the regression output.

Table: Regression output showing the coefficients

The results t = 4.88 and tcr = 2.02 show that t > tcr, and therefore t falls in the rejection region (see the following Figure).

11 Plot the density function and the values of t: curve(dt(x, df), -2.5seb2, 2.5seb2, ylab=” “, xlab=”t”) abline(v=c(-tcr, tcr, t), col=c(“red”, “red”, “blue”),

lty=c(2,2,3))

legend(“topleft”, legend=c(“-tcr”, “tcr”, “t”), col= c(“red”, “red”, “blue”), lty=c(2, 2, 3))

Figure: A two-tail hypothesis testing for b2 in the food example

Suppose we are interested to determine if β2 is greater than 5.5. This conjecture will go into the alternative hypothesis: H0 ≤ 5.5, HA > 5.5. The procedure is the same as for the two-tail test, but now the whole rejection region is to the right of the critical value tcr.

The Figure below shows tcr = 1.685954, t = 2.249904. Since t falls again in the rejection region, we can reject the null hypothesis H0 : β2 ≤ 0.

Figure: Right-tail test: the rejection region is to the right of tcr

A left-tail test is not different from the right-tail one, but of course the rejection region is to the left of tcr. For example, if we are interested to determine if β2 is less than15, we place this conjecture in the alternative hypothesis: H0 ≥ 15, HA < 15. The novelty here is how we use the qt() function to calculate tcr: instead of qt(1-alpha, …), we need to use qt(alpha, …). The Figure below illustrates this example, where the rejection region is, remember, to the left of tcr.

R does automatically a test of significance, which is indeed testing the hypothesis H0 : β2 = 0, HA : β2 6= 0. The regression output shows the values of the t-ratio for all the regression coefficients.

library(PoEdata) data(“food”)

mod1 <- lm(food_exp ~ income, data = food) table <- data.frame(round(xtable(summary(mod1)), 3)) kable(table, caption = “Regression output for the ‘food’ model”)

Figure: Left-tail test: the rejection region is to the left of tcr =

The Table below shows the regression output where the t-statistics of the coefficients can be observed.

Table: Regression output for the ’food’ model

Lab 2 -The p-Value

In the context of a hypothesis test, the p-value is the area outside the calculated t-statistic; it is the probability that the t-ratio takes a value that is more extreme than the calculated one, under the assumption that the null hypothesis is true. We reject the null hypothesis if the p-value is less than a chosen significance level. For a right-tail test, the p-value is the area to the right of the calculated t; for a left-tail test it is the area to the left of the calculated t; for a two-tail test the p-value is split in two equal amounts: p/2 to the left and p/2 to the right. p-values are calculated in R by the function pt(t, df), where t is the calculated t-ratio and df is the number of degrees of freedom in the estimated model.

Right-tail test, H0 : β2 ≤ c, HA : β2 > c.

12 Calculating the p-value for a right-tail test

c <- 5.5

t <- (b2-c)/seb2

p <- 1-pt(t, df) # pt() returns p-values;

The right-tail test shown in the previous lab gives the p-value p = 0.01516.

Left-tail test, H0 : β2 ≥ c, HA : β2 < c.

The left-tail test shown in the previous Lab gives the p-value p = 0.01388.

Two-tail test, H0 : β2 = c, HA : β2 ≠c.

The two-tail test shown in the Figure below gives the p-value p = 2 × 10−5, for a t-ratio t = 4.88.

curve(dt(x, df), from=-2.5seb2, to=2.5seb2)

abline(v=c(-t, t), col=c(“blue”, “blue”), lty=c(2, 2))

legend(“topright”, legend=c(“-t”, “t”), col=c(“blue”, “blue”), lty=c(2, 4))

Figure: The p-value in two-tail hypothesis testing

R gives the p-values in the standard regression output, which we can retrieve using the summary(model) function. The Table below shows the output of the regression model, where the p-values can be observed.

Table 3.4: Regression output showing p-values

Testing Linear Combinations of Parameters

Sometimes we wish to estimate the expected value of the dependent variable, y, for a given value of x. For example, according to our food model, what is the average expenditure of a household having income of $2000? We need to estimate the linear combination of the regression coefficients β1 and β2 given in Equation 3.7 (let’s denote the linear combination by L).

L = E(food_exp|income = 20) = β1 + 20β2 (3.7)

Finding confidence intervals and testing hypotheses about the linear combination in Equation 3.7 requires calculating a t-statistic similar to the one for the regression coefficients we calculated before. However, estimating the standard error of the linear combination is not as straightforward. In general, if X and Y are two random variables and a and b two constants, the variance of the linear combination aX + bY is

var(aX + bY ) = a2var(X) + b2var(Y ) + 2abcov(X,Y ). (3.8)

Now, let us apply the formula in Equation 3.8 to the linear combination of β1 and β2 given by Equation 3.7, we obtain Equation 3.9.

var(b1 + 20b2) = var(b1) + 202var(b2) + 2 × 20cov(b1b2) (3.9)

The following sequence of code determines an interval estimate for the expected value of food expenditure in a household earning $2000 a week.

3.7. TESTING LINEAR COMBINATIONS OF PARAMETERS

Figure 3.5: p−Values for positive and negative t as calculated using the formula 1 − pt(t,df)

The result is the confidence interval (258.91, 316.31). Next, we test hypotheses about the linear combination L defined in Equation 3.7, looking at the three types of hypotheses: two-tail, left-tail, and right-tail. Equations 3.10 − 3.12 show the test setups for a hypothesized value of food expenditure c.

One should use the function pt(t, df) carefully, because it gives wrong results when testing hypotheses using the p-value metod and the calculated t is negative. Therefore, the absolute value of t should be used. Figure 3.5 shows the p-values calculated with the formula 1-pt(t, df). When t is positive and the test is two-tail, doubling the p-value 1-pt(t, df) is correct; but when t is negative, the correct p-value is 2*p(t, df).

.shadenorm(above=1.6, justabove=TRUE) segments(1.6,0,1.6,0.2,col=“blue”, lty=3)

legend(“topleft”, legend=“t”, col=“blue”, lty=3)

.shadenorm(above=-1.6, justabove=TRUE) segments(-1.6,0,-1.6,0.2,col=“blue”, lty=3) legend(“topleft”, legend=“t”, col=“blue”, lty=3) The next sequence uses the values already calculated before, a hypothesized level of food expenditure c=$250, and an income of $2000; it tests the two-tail hypothesis in Equation 3.10 first using the “critical t” method, then using the p-value method.

The results are: t = 2.65, tcr = 2.02, and p = 0.0116. Since t > tcr, we reject the null hypothesis. The same result is given by the p-value method, where the p-value is twice the probability area determined by the calculated t.

Lab 3 - Prediction, R-squared, and Modelling

rm(list=ls()) # Caution: this clears the Environment

A prediction is an estimate of the value of y for a given value of x, based on a regression model of the form shown in Equation 4.1. Goodness-of-fit is a measure of how well an estimated regression line approximates the data in a given sample. One such measure is the correlation coefficient between the predicted values of y for all x-s in the data file and the actual y-s. Goodness-of-fit, along with other diagnostic tests help determining the most suitable functional form of our regression equation, i.e., the most suitable mathematical relationship between y and x.

yi = β1 + β2xi + ei (4.1)

Forecasting (Predicting a Particular Value)

Assuming that the expected values of the error term in Equation 4.1 is zero, Equation 4.2 gives yˆi, the predicted value of the expectation of yi given xi, where b1 and b2 are the (least squares) estimates of the regression parameters β1 and β2.

yˆi = b1 + b2xi (4.2)

The predicted value yˆi is a random variable, since it depends on the sample; therefore, we can calculate a confidence interval and test hypothesis about it, provided we can determine its distribution and variance. The prediction has a normal distribution,

45

being a linear combination of two normally distributed random variables b1 and b2, and its variance is given by Equation 4.3. Please note that the variance in Equation 4.3 is not the same as the one in Equation 3.9; the former is the variance of the estimated expectation of y, while the latter is the variance of a particular occurrence of y. Let us call the latter the variance of the forecast error. Not surprisingly, the variance of the forecast error is greater than the variance of the predicted E(y|x).

As before, since we need to use an estimated variance, we use a t-distribution instead of a normal one. Equation 4.3 applies to any given x, say x0, not only to those x-s in the dataset.

” 2 #

 2 1 (xi − x¯) , (4.3)

=1

which can be reduced to

(4.4)

N

Let’s determine a standard error for the food equation for a household earning $2000 a week, i.e., at x = x0 = 20, using Equation 4.4; to do so, we need to retrieve var(b2) and σˆ, the standard error of regression from the regression output.

4.1. FORECASTING (PREDICTING A PARTICULAR VALUE)

The result is the confidence interval for the forecast (104.13,471.09), which is, as expected, larger than the confidence interval of the estimated expected value of y based on Equation 3.9.

Let us calculate confidence intervals of the forecast for all the observations in the sample and draw the upper and lower limits together with the regression line. Figure 4.1 shows the confidence interval band about the regression line.

A different way of finding point and interval estimates for the predicted E(y|x) and forecasted y (please see the distinction I mentioned above) is to use the predict() function in R. This function requires that the values of the independent variable where the prediction (or forecast) is intended have a data frame structure. The next example shows in parallel point and interval estimates of predicted and forecasted food expenditures for income is $2000. As I have pointed out before, the point estimate is the same for both prediction and forecast, but the interval estimates are very different.

incomex=data.frame(income=20)

predict(m1, newdata=incomex, interval=“confidence”,level=0.95)

Figure 4.1: Forecast confidence intervals for the food simple regression

Let us now use the predict() function to replicate Figure 4.1. The result is Figure

4.2, which shows, besides the interval estimation band, the points in the dataset. (I will create new values for income just for the purpose of plotting.)

4.2. GOODNESS-OF-FIT

Figure 4.2: Predicted and forecasted bands for the food dataset

Figure 4.2 presents the predicted and forecasted bands on the same graph, to show that they have the same point estimates (the black, solid line) and that the forecasted band is much larger than the predicted one. Put another way, you may think about the distinction between the two types of intervals that we called prediction and forecast as follows: the prediction interval is not supposed to include, say, 95 percent of the points, but to include the regression line, E(y|x), with a probability of 95 percent; the forecasted interval, on the other hand, should include any true point with a 95 percent probability.

4.2 Goodness-of-Fit

The total variation of y about its sample mean, SST, can be decomposed in variation about the regression line, SSE, and variation of the regression line about the mean of y, SSR, as Equation 4.5 shows.

SST = SSR + SSE (4.5)

The coefficient of determination, R2, is defined as the proportion of the variance in y that is explained by the regression, SSR, in the total variation in y, SST. Dividing both sides of the Equation 4.5 by SST and re-arranging terms gives a formula to calculate R2, as shown in Equation 4.6.

2 SSR SSE

R = = 1 − (4.6)

SST SST

R2 takes values between 0 and 1, with higher values showing a closer fit of the regression line to the data. In R, the value of R2 can be retrieved from the summary of the regression model under the name r.squared; for instance, in our food example,

R2 = 0.385. R2 is also printed as part of the summary of a regression model, as the following code sequence shows. (The parentheses around a command tells R to print the result.)

(rsq <- sm1$r.squared) #or

12.1 [1] 0.385002

sm1 #prints the summary of regression model m1

12.2

12.3 Call:

12.4 lm(formula = food_exp ~ income, data = food)

12.5

12.6 Residuals:

12.7 Min 1Q Median 3Q Max

12.8 -223.03 -50.82 -6.32 67.88 212.04

12.9

12.10 Coefficients:

12.11 Estimate Std. Error t value Pr(>|t|)

12.12 (Intercept) 83.42 43.41 1.92 0.062 .

12.13 income 10.21 2.09 4.88 0.000019 ***

12.14

12.15 Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

12.16

12.17 Residual standard error: 89.5 on 38 degrees of freedom

12.18 Multiple R-squared: 0.385, Adjusted R-squared: 0.369

12.19 F-statistic: 23.8 on 1 and 38 DF, p-value: 0.0000195

If you need the sum of squared errors, SSE, or the sum of squares due to regression, SSR, use the anova function, which has the structure shown in Table 4.1.

anov <- anova(m1) dfr <- data.frame(anov)

kable(dfr,

caption=“Output generated by the anova function”)

Table 4.1 indicates that SSE = anov[2,2] = 3.045052 × 105, SSR = anov[1,2]

= 1.90627 × 105, and SST = anov[1,2]+anov[2,2] = 4.951322 × 105. In our

4.3. LINEAR-LOG MODELS

Table 4.1: Output generated by the ’anova‘ function

simple regression model, the sum of squares due to regression only includes the variable income. In multiple regression models, which are models with more than one independent variable, the sum of squares due to regression is equal to the sum of squares due to all independent variables. The anova results in Table 4.1 include other useful information: the number of degrees of freedom, anov[2,1] and the estimated variance σˆ 2 =anov[2,3].

Lab 4 Linear-Log Models

Non-linear functional forms of regression models are useful when the relationship between two variables seems to be more complex than the linear one. One can decide to use a non-linear functional form based on a mathematical model, reasoning, or simply inspecting a scatter plot of the data. In the food expenditure model, for example, it is reasonable to believe that the amount spent on food increases faster at lower incomes than at higher incomes. In other words, it increases at a decreasing rate, which makes the regression curve flatten out at higher incomes.

What function could one use to model such a relationship? The logarithmic function fits this profile and, as it turns out, it is relatively easy to interpret, which makes it very popular in econometric models. The general form of a linear-log econometric model is provided in Equation 4.7.

yi = β1 + β2log(xi) + ei (4.7)

The marginal effect of a change in x on y is the slope of the regression curve and is given by Equation 4.8; unlike in the linear form, it depends on x and it is, therefore, only valid for small changes in x.

dy β2

= (4.8) dx x

Related to the linear-log model, another measure of interest in economics is the semi-elasticity of y with respect to x, which is given by Equation 4.9. Semi-elasticity suggests that a change in x of 1% changes y by β2/100 units of y. Since semi-elasticity also changes when x changes, it should only be determined for small changes in x.

Table 4.2: Linear-log model output for the food example

dy = (%∆x) (4.9)

Another quantity that might be of interest is the elasticity of y with respect to x, which is given by Equation 4.10 and indicates that a one percent increase in x produces a (β2/y) percent change in y.

β2

%∆y = (%∆x) (4.10)

y

Let us estimate a linear-log model for the food dataset, draw the regression curve, and calculate the marginal effects for some given values of the dependent variable.

mod2 <- lm(food_exp~log(income), data=food)

tbl <- data.frame(xtable(mod2)) kable(tbl, digits=5, caption=“Linear-log model output for the food example”)

The results for an income of $1000 are as follows: dy/dx = 13.217, which indicates that an increase in income of $100 (i.e., one unit of x) increases expenditure by $

4.4. RESIDUALS AND DIAGNOSTICS

Figure 4.3: Linear-log representation for the food data

13.217; for a 1% increase in income, that is, an increase of $10, expenditure increases by $ 1.322; and, finally, for a 1% increase in income expenditure incrases by 0.638%.

4.4 Residuals and Diagnostics

Regression results are reliable only to the extent to which the underlying assumptions are met. Plotting the residuals and calculating certain test statistics help deciding whether assumptions such as homoskedasticity, serial correlation, and normality of the errors are not violated. In R, the residuals are stored in the vector residuals of the regression output.

ehat <- mod2\(residuals plot(food\)income, ehat, xlab=“income”, ylab=“residuals”)

Figure 4.4 shows the residuals of the of the linear-log equation of the food expenditure example. One can notice that the spread of the residuals seems to be higher at higher incomes, which may indicate that the heteroskedasticity assumption is violated.

Let us draw a residual plot generated with a simulated model that satisfies the regression assumptions. The data generating process is given by Equation 4.11, where x is a number between 0 and 10, randomly drawn from a uniform distribution, and the error term is randomly drawn from a standard normal distribution. Figure

Figure 4.4: Residual plot for the food linear-log model 4.5 illustrates this simulated example.

yi = 1 + xi + ei, i = 1,…,N (4.11)

set.seed(12345) #sets the seed for the random number generator x <- runif(300, 0, 10) e <- rnorm(300, 0, 1) y <- 1+x+e mod3 <- lm(y~x) ehat <- resid(mod3) plot(x,ehat, xlab=“x”, ylab=“residuals”)

The next example illustrates how the residuals look like when a linear functional form is used when the true relationship is, in fact, quadratic. The data generating equation is given in Equation 4.12, where x is the same uniformly distributed between −2.5 and 2.5), and e ∼ N(0,4). Figure 4.6 shows the residuals from estimating an incorrectly specified, linear econometric model when the correct specification should be quadratic.

yi = 15 − 4x2i + ei, i = 1,…,N (4.12)

4.4. RESIDUALS AND DIAGNOSTICS

Figure 4.5: Residuals generated by a simulated regression model that satisfies the regression assumptions

Another assumption that we would like to test is the normality of the residuals, which assures reliable hypothesis testing and confidence intervals even in small samples. This assumption can be assessed by inspecting a histogram of the residuals, as well as performing a Jarque-Bera test, for which the null hypothesis is “Series is normally distributed”. Thus, a small p-value rejects the null hypothesis, which means the series fails the normality test. The Jarque-Bera test requires installing and loading the package tseries in R. Figure 4.7 shows a histogram and a superimposed normal distribution for the linear food expenditure model.

Figure 4.6: Simulated quadratic residuals from an incorrectly specified econometric model

jarque.bera.test(ehat) #(in package ‘tseries’)

12.20

12.21 Jarque Bera Test

12.22

12.23 data: ehat

12.24 X-squared = 0.06334, df = 2, p-value = 0.969

While the histogram in Figure 4.7 may not strongly support one conclusion or another about the normlity of ehat, the Jarque-Bera test is unambiguous: there is no evidence against the normality hypothesis.

4.5. POLYNOMIAL MODELS

Figure 4.7: Histogram of residuals from the food linear model

Lab 5- Polynomial Models

Regression models may include quadratic or cubic terms to better describe the nature of the dadta. The following is an example of quadratic and cubic model for the wa_wheat dataset, which gives annual wheat yield in tonnes per hectare in

Greenough Shire in Western Australia over a period of 48 years. The linear model is given in Equation 4.13, where the subscript t indicates the observation period.

yieldt = β1 + β2timet + et (4.13)

Figure 4.8 shows a pattern in the residuals generated by the linear model, which may inspire us to think of a more appropriate functional form, such as the one in Equation 4.14.

yieldt = β1 + β2time3t + et (4.14)

Figure 4.8: Residuals from the linear wheatyield model

Please note in the following code sequence the use of the function I(), which is needed in R when an independent variable is transformed by mathematical operators. You do not need the operator I() when an independent variable is transformed through a function such as log(x). In our example, the transformation requiring the use of I() is raising time to the power of 3. Of course, you can create a new variable, x3=xˆ3 if you wish to avoid the use of I() in a regression equation.

mod2 <- lm(wa_wheat$greenough~I(time^3), data=wa_wheat)

ehat <- resid(mod2) plot(wa_wheat$time, ehat, xlab=“time”, ylab=“residuals”)

Figure 4.9 displays a much better image of the residuals than Figure 4.8, since the residuals are more evenly spread about the zero line.

4.6 Log-Linear Models

Transforming the dependent variable with the log() function is useful when the variable has a skewed distribution, which is in general the case with amounts that cannot be negative. The log() transformation often makes the distribution closer to normal. The general log-linear model is given in Equation 4.15.

4.6. LOG-LINEAR MODELS

Figure 4.9: Residuals from the cubic wheatyield model

log(yi) = β1 + β2xi + ei (4.15)

The following formulas are easily derived from the log-linear Equation 4.15. The semi-elasticity has here a different interpretation than the one in the linear-log model: here, an increase in x by one unit (of x) produces a change of 100b2 percent in y. For small changes in x, the amount 100b2 in the log-linear model can also be interpreted as the growth rate in y (corresponding to a unit increase in x). For instance, if x is time, then 100b2 is the growth rate in y per unit of time.

Prediction: yˆn = exp(b1 + b2x), or yˆc = exp(b1 + b2x + σˆ22), with the “natural” predictor yˆn to be used in small samples and the “corrected” predictor, yˆc, in large samples

Marginal effect (slope): dxdy = b2y

Semi-elasticity: %∆y = 100b2∆x

Let us do these calculations first for the yield equation using the wa_wheat dataset.

Table 4.3 gives b2 = 0.017844, which indicates that the rate of growth in wheat Table 4.3: Log-linear model for the yield equation

Table 4.4: Log-linear ’wage’ regression output

production has increased at an average rate of approximately 1.78 percent per year.

The wage log-linear equation provides another example of calculating a growth rate, but this time the independent variable is not time, but education. The predictions and the slope are calculated for educ = 12 years.

Here are the results of these calculations: “natural” prediction yˆn = 14.796; corrected prediction, yˆc = 16.996; growth rate g = 9.041; and marginal effect dxdy = 1.34. The growth rate indicates that an increase in education by one unit (see the data description using ?cps4_small) increases hourly wage by 9.041 percent.

Figure 4.10 presents the “natural” and the “corrected” regression lines for the wage equation, together with the actual data points.

education=seq(0,22,2) yn <- exp(b1+b2education) yc <- exp(b1+b2education+sighat2/2)

4.6. LOG-LINEAR MODELS

Figure 4.10: The ’normal’ and ’corrected’ regression lines in the log-linear wage equation

plot(cps4_small\(educ, cps4_small\)wage, xlab=“education”, ylab=“wage”, col=“grey”) lines(yn~education, lty=2, col=“black”) lines(yc~education, lty=1, col=“blue”) legend(“topleft”, legend=c(“yc”,“yn”), lty=c(1,2), col=c(“blue”,“black”))

The regular R2 cannot be used to compare two regression models having different dependent variables such as a linear-log and a log-linear models; when such a comparison is needed, one can use the general R2, which is Rg2 = [corr(y,yˆ]2. Let us calculate the generalized R2 for the quadratic and the log-linear wage models.

rg5 <- cor(cps4_small$wage,yhat5)^2

The quadratic model yields Rg2 = 0.188, and the log-linear model yields Rg2 = 0.186; since the former is higher, we conclude that the quadratic model is a better fit to the data than the log-linear one. (However, other tests of how the two models meet the assumptions of linear refgression may reach a different conclusion; R2 is only one of the model selection criteria.)

To determne a forecast interval estimate in the log-linear model, we first construct the interval in logs using the natural predictor yˆn, then take antilogs of the interval limits. The forecasting error is the same as before, given in Equation 4.4. The following calculations use an education level equal to 12 and α = 0.05.

The result is the confidence interval (5.26, 41.62). Figure 4.11 shows a 95% confidence band for the log-linear wage model.

4.7. THE LOG-LOG MODEL

Figure 4.11: Confidence band for the log-linear wage equation

4.7 The Log-Log Model

The log-log model has the desirable property that the coefficient of the independent variable is equal to the (constant) elasticity of y with respect to x. Therefore, this model is often used to estimate supply and demand equations. Its standard form is given in Equation 4.16, where y, x, and e are N × 1 vectors.

log(y) = β1 + β2log(x) + e (4.16)

Table 4.5: The log-log poultry regression equation

Table 4.5 gives the log-log regression output. The coefficient on p indicates that an increase in price by 1% changes the quantity demanded by −1.121%.

13 The generalized R-squared: rgsq <- cor(newbroiler$q, yhatc)^2

The generalized R2, wich uses the corrected fitted values, is equal to 0.8818.

4.7. THE LOG-LOG MODEL

Figure 4.12: Log-log demand for chicken

Lab 6 – Hands-on on your assignment

References

Adkins, Lee. 2014. Using Gretl for Principles of Econometrics, 4th Edition. Economics Working Paper Series. 1412. Oklahoma State University, Department of Economics; Legal Studies in Business.

Allaire, JJ, Joe Cheng, Yihui Xie, Jonathan McPherson, Winston Chang, Jeff Allen, Hadley Wickham, Aron Atkins, and Rob Hyndman. 2016. Rmarkdown: Dynamic Documents for R.

Colonescu, Constantin. 2016. PoEdata: PoE Data for R.

Croissant, Yves, and Giovanni Millo. 2015. Plm: Linear Models for Panel Data.

Dahl, David B. 2016. Xtable: Export Tables to Latex or Html.

Fox, John, and Sanford Weisberg. 2016. Car: Companion to Applied Regression.

Fox, John, Sanford Weisberg, Michael Friendly, and Jangman Hong. 2016. Effects:

Effect Displays for Linear, Generalized Linear, and Other Models.

Ghalanos, Alexios. 2015. Rugarch: Univariate Garch Models.

Graves, Spencer. 2014. FinTS: Companion to Tsay (2005) Analysis of Financial Time Series.

Grolemund, Garrett, and Hadley Wickham. 2016. R for Data Science. Online book.

Henningsen, Arne, and Jeff D. Hamann. 2015. Systemfit: Estimating Systems of Simultaneous Equations.

Hill, R.C., W.E. Griffiths, and G.C. Lim. 2011. Principles of Econometrics. Wiley.

251

252 CHAPTER 16. QUALITATIVE AND LDV MODELS

Hlavac, Marek. 2015. Stargazer: Well-Formatted Regression and Summary Statistics Tables.

Hothorn, Torsten, Achim Zeileis, Richard W. Farebrother, and Clint Cummins. 2015. Lmtest: Testing Linear Regression Models.

Hyndman, Rob. 2016. Forecast: Forecasting Functions for Time Series and Linear Models.

Kleiber, Christian, and Achim Zeileis. 2015. AER: Applied Econometrics with R.

Komashko, Oleh. 2016. NlWaldTest: Wald Test of Nonlinear Restrictions and Nonlinear Ci.

Lander, Jared P. 2013. R for Everyone: Advanced Analytics and Graphics. 1st ed. Addison-Wesley Professional.

Lumley, Thomas, and Achim Zeileis. 2015. Sandwich: Robust Covariance Matrix Estimators.

Pfaff, Bernhard. 2013. Vars: VAR Modelling.

R Development Core Team. 2008. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.

Reinhart, Abiel. 2015. Pdfetch: Fetch Economic and Financial Time Series Data from Public Sources.

Robinson, David. 2016. Broom: Convert Statistical Analysis Objects into Tidy Data Frames.

RStudio Team. 2015. RStudio: Integrated Development Environment for R. Boston, MA: RStudio, Inc.

Spada, Stefano, Matteo Quartagno, and Marco Tamburini. 2012. Orcutt: Estimate Procedure in Case of First Order Autocorrelation.

Trapletti, Adrian, and Kurt Hornik. 2016. Tseries: Time Series Analysis and Computational Finance.

Wickham, Hadley, and Winston Chang. 2016. Devtools: Tools to Make Developing

16.10. THE HECKIT, OR SAMPLE SELECTION MODEL 253

R Packages Easier.

Xie, Yihui. 2014. Printr: Automatically Print R Objects According to Knitr Output Format.

———. 2016a. Bookdown: Authoring Books with R Markdown.

———. 2016b. Knitr: A General-Purpose Package for Dynamic Report Generation in R.

Zeileis, Achim. 2016. Dynlm: Dynamic Linear Regression.